1 Goal

Describe variation in spawn timing and how it relates to environmental covariates.

2 Study area and species

This study was conducted in the Middle Fork of the Salmon River (MFSR) in central Idaho (1). The MFSR is a tributary of the Salmon River and is part of the larger Columbia River Basin. The MFSR is home to several species of salmon, including Chinook salmon (Oncorhynchus tshawytscha).

Map of the Middle Fork Salmon River (MFSR) study area showing redd locations used in the analysis (2002-2005) and stream reaches.

Figure 1: Map of the Middle Fork Salmon River (MFSR) study area showing redd locations used in the analysis (2002-2005) and stream reaches.

3 Data prep and inspection

3.1 Spawn timing data

Spawn timing data for Chinook salmon were collected from 2001 to 2005 in the MFSR.

We removed data from 2001, and data from Knapp Creek and Cape Horn Creek, as these sites were not consistently sampled.

Because redds were not observed daily, we inferred spawn dates as the initial date a completed redd was observed.

We joined each redd location to the NHD and assigned them a COMID, analogous to a stream reach. The COMID is used to link the spawn time data with covariate data associated with the stream reach on which it is located.

## # A tibble: 3,016 × 6
##    redd_id COMID    spawn_date stream      year   yday
##      <dbl> <fct>    <date>     <fct>       <fct> <dbl>
##  1    1295 23519365 2002-08-23 Bear Valley 2002    235
##  2    1296 23519365 2002-08-23 Bear Valley 2002    235
##  3    1298 23519319 2002-08-23 Bear Valley 2002    235
##  4    1299 23519319 2002-08-23 Bear Valley 2002    235
##  5    1300 23519319 2002-08-23 Bear Valley 2002    235
##  6    1301 23519317 2002-08-23 Bear Valley 2002    235
##  7    1302 23519317 2002-08-26 Bear Valley 2002    238
##  8    1303 23519317 2002-09-03 Bear Valley 2002    246
##  9    1304 23519317 2002-08-26 Bear Valley 2002    238
## 10    1305 23519317 2002-09-03 Bear Valley 2002    246
## # ℹ 3,006 more rows

The data comprise 3016 redd observation from 8 streams across 4 years. The redds were observed between day 206 and 263.

3.1.1 Variation in spawn timing

Histogram and density of spawn timing data (DOY) for all streams and years.

Figure 2: Histogram and density of spawn timing data (DOY) for all streams and years.

  • mean and median are equal, low SD
  • var < mean (no overdispersion) data are right-skewed, lumpy/multimodal, and not symmetric at least
  • Poisson family is response is count of spawning events per day, Gaussian if model density as continuous

3.1.2 Spawn time variation by year and stream

Spawn time variation by year and stream.

Figure 3: Spawn time variation by year and stream.

3.1.3 Spawn time by COMID and stream

Spawn time by COMID and stream.

Figure 4: Spawn time by COMID and stream.

3.1.4 Spawn time distributions by stream

Spawning phenology of adult Chinook Salmon. In all panels, the black density function represented stream-level spawn timing, while the colored density functions represent the spawn timing of individual years. The dashed vertical purple lines represent the 5th and 95th percentiles of the basin-wide spawn timing.

Figure 5: Spawning phenology of adult Chinook Salmon. In all panels, the black density function represented stream-level spawn timing, while the colored density functions represent the spawn timing of individual years. The dashed vertical purple lines represent the 5th and 95th percentiles of the basin-wide spawn timing.

3.1.5 Cumulative proportions of redds by stream

Cumulative proportion of completed of redds by stream.

(#fig:plot-cum-props, )Cumulative proportion of completed of redds by stream.

3.2 Covariates

To test for environmental factors driving variation in spawn timing, we quantified associations with metrics describing thermal and physical conditions in stream reaches.

We selected covariates based on the following criteria: (1) they are known to influence spawn timing, (2) they are available for all streams, and (3) they are not highly correlated with each other.

Our focal independent variable were:

  • stream temperature (°C) - in-basin effect on how fast fish ripen and commit to spawning
  • stream discharge (cms) - out of basin year effect on when fish initially make it to spawn grounds
  • elevation (m above sea level)
  • stream gradient (slope)

3.2.1 Stream temperature

We used modeled daily average stream temperature values predicted at the stream segment (COMID) scale (Siegel et al. 2023; available from https://zenodo.org/records/8174951). These data were downloaded and filtered to 2001-2005 and for the MFSR.

3.2.1.1 Summarized thermal regimes by stream
Modeled thermal regimes (2001-2005) for MFSR tributaries. Black line = mean, Red ribbon = 40 - 60th percentiles, Grey ribbon = full range.

Figure 6: Modeled thermal regimes (2001-2005) for MFSR tributaries. Black line = mean, Red ribbon = 40 - 60th percentiles, Grey ribbon = full range.

3.2.1.2 Stream temperature metrics

We calculated metrics relative to a COMID a redd was constructed on and a redd completion date; before, after, and spanning the date.

For example, temp_30_before is the average temperature for a COMID where a redd was constructed calculated over the previous 30 days. We did this for 30, 60, and 90 days.

We also calculated a time invariant metric relative to a fixed date across all years that was chosen to represent an initial spawning window, e.g., August 1.

The time invariant and after metrics were omitted from further consideration as preliminary data exploration showed weak if any relationship with spawn timing.

## # A tibble: 6 × 6
##   redd_id    COMID spawn_date temp_30 temp_60 temp_90
##     <dbl>    <dbl> <date>       <dbl>   <dbl>   <dbl>
## 1    1000 23519529 2001-08-31    12.6    12.3    11.2
## 2    1001 23519531 2001-08-31    12.7    12.3    11.3
## 3    1002 23519531 2001-08-31    12.7    12.3    11.3
## 4    1003 23519531 2001-08-31    12.7    12.3    11.3
## 5    1004 23519531 2001-08-31    12.7    12.3    11.3
## 6    1005 23519531 2001-08-31    12.7    12.3    11.3

3.2.2 Discharge (streamflow)

Stream flow data were compiled from a single USGS Gage lower in the watershed (MF Salmon River at MF Lodge NR Yellow Pine ID - 13309220). Becuase flow data are not COMID- or stream-specific, it makes sense to think about and represent flow as an out-of-basin year effect that determines when adults make it back to the MFSR and initially onto the spawning grounds.

3.2.2.1 Inter-annual variability
Inter-annual variability in daily discharge (cfs) at MF Lodge USGS Gage 13309220.

Figure 7: Inter-annual variability in daily discharge (cfs) at MF Lodge USGS Gage 13309220.

3.2.2.2 Flow metrics

We calculated flow metrics relative to a COMID a redd was constructed on and a redd completion date; before, after, and spanning the date.

For example, temp_30_before is the average temperature for a COMID where a redd was constructed calculated over the previous 30 days. We did this for 30, 60, and 90 days.

The spanning and after metrics were omitted from further consideration as preliminary data exploration showed weak if any relationship with spawn timing.

## # A tibble: 124 × 4
##    spawn_date flow_30 flow_60 flow_90
##    <date>       <dbl>   <dbl>   <dbl>
##  1 2001-08-31    363.    449.    628.
##  2 2001-08-27    394.    475     683.
##  3 2001-08-28    390.    467.    668.
##  4 2001-08-29    385.    461.    653.
##  5 2002-08-23    651.    997.   1993.
##  6 2002-08-26    624.    912.   1882.
##  7 2002-09-03    584.    766.   1412.
##  8 2002-08-19    697.   1140.   2151.
##  9 2002-08-20    684.   1106.   2099.
## 10 2002-08-24    642.    967.   1960.
## # ℹ 114 more rows

3.2.3 Elevation and stream gradient (slope)

Elevation and stream slope data were available at the COMID (stream reach) scale from the NHD.

## # A tibble: 231,807 × 3
##       COMID  slope mean_elevation
##       <int>  <dbl>          <dbl>
##  1 24202021 0.0134          1231.
##  2 24202023 0.112           1024.
##  3 24202033 0.0633          1603.
##  4 24202025 0.177           1893.
##  5 24202027 0.109           1080.
##  6 24202031 0.107           1213.
##  7 24202035 0.141           1037.
##  8 24202037 0.0583          1444.
##  9 24202039 0.139           1370.
## 10 24202041 0.130           1083.
## # ℹ 231,797 more rows

3.2.4 Combine datasets

## Rows: 3,013
## Columns: 14
## $ redd_id        <dbl> 1295, 1296, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1…
## $ COMID          <fct> 23519365, 23519365, 23519319, 23519319, 23519319, 23519…
## $ spawn_date     <date> 2002-08-23, 2002-08-23, 2002-08-23, 2002-08-23, 2002-0…
## $ stream         <fct> Bear Valley, Bear Valley, Bear Valley, Bear Valley, Bea…
## $ year           <fct> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
## $ yday           <dbl> 235, 235, 235, 235, 235, 235, 238, 246, 238, 246, 246, …
## $ flow_30        <dbl> 650.8333, 650.8333, 650.8333, 650.8333, 650.8333, 650.8…
## $ flow_60        <dbl> 997.4333, 997.4333, 997.4333, 997.4333, 997.4333, 997.4…
## $ flow_90        <dbl> 1992.733, 1992.733, 1992.733, 1992.733, 1992.733, 1992.…
## $ temp_30        <dbl> 12.70318, 12.70318, 14.02703, 14.02703, 14.02703, 13.78…
## $ temp_60        <dbl> 12.87280, 12.87280, 14.20337, 14.20337, 14.20337, 13.95…
## $ temp_90        <dbl> 11.07830, 11.07830, 12.15472, 12.15472, 12.15472, 11.92…
## $ slope          <dbl> 0.00299158, 0.00299158, 0.00171806, 0.00171806, 0.00171…
## $ mean_elevation <dbl> 1951.730, 1951.730, 1946.735, 1946.735, 1946.735, 1944.…

4 Bivariate relationships with covariates

  • temp_30 = no relationship, drop
  • temp_60 = good non-linear, drop for temp_90?
  • temp_90 = better non-linear
  • flow_30 and flow_60 = similar decaying exponential
  • flow_90 = inflections, interesting grouping really spreads out, year effect
  • SLOPE = no relationship, drop
  • mean_elevation = slightly negative linear?

5 Check for co-linearity

model_data |> 
  select(temp_30, temp_60, temp_90, flow_30, flow_60, flow_90, mean_elevation) |> 
  GGally::ggpairs()

  • There is strong colinearity among temp variable. Retaining temp_90 as it has the strongest relationship.
  • flow_30 and flow_60 are highly correlated with flow_90.
  • Keep flow_90 as it has the strongest relationship.
  • elevation good to keep

Check VIFs with and without flow_30 and flow_60:

car::vif(lm(yday ~ flow_30 + flow_60 + flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
##                    GVIF Df GVIF^(1/(2*Df))
## flow_30        54.18575  1        7.361097
## flow_60        42.25228  1        6.500176
## flow_90        13.47150  1        3.670354
## temp_90        15.37376  1        3.920938
## mean_elevation 21.09913  1        4.593379
## stream         64.19481  7        1.346192
## year           54.26889  3        1.945771
car::vif(lm(yday ~ flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
##                     GVIF Df GVIF^(1/(2*Df))
## flow_90         9.605784  1        3.099320
## temp_90        14.146143  1        3.761136
## mean_elevation 20.418174  1        4.518647
## stream         53.393667  7        1.328594
## year            8.676807  3        1.433486
  • more reason to drop flow_30 and flow_60

6 Closer look at covariates

6.1 Temp_90

Spawn timing vs. 90-day mean temperature pre spawn.

(#fig:plot-temp_90)Spawn timing vs. 90-day mean temperature pre spawn.

  • clear positive relationship, certainly some non-linearity
  • stream- and year-level variation (interactions)

Check simple models for temp to examine functional structure and compare with AIC:

m1 <- lm(yday ~ temp_90, data = model_data)
m2 <- lm(yday ~ I(temp_90^2), data = model_data)
m3 <- lm(yday ~ temp_90 + year, data = model_data)
m4 <- lm(yday ~ temp_90 + year + stream, data = model_data)
m5 <- lm(yday ~ temp_90 * stream, data = model_data)
m6 <- lm(yday ~ temp_90 * stream + year, data = model_data)
m7 <- lm(yday ~ temp_90 * year + stream, data = model_data)
m8 <- lm(yday ~ temp_90 * stream + I(temp_90^2), data = model_data)
m9 <- lm(yday ~ temp_90 * stream + year + I(temp_90^2), data = model_data)
df AIC delta
m9 21 15602.84 0.0000
m6 20 15846.65 243.8107
m7 16 16110.32 507.4761
m8 18 16192.42 589.5746
m4 13 16245.22 642.3722
m5 17 16547.05 944.2027
m3 6 17613.48 2010.6394
m1 3 17983.29 2380.4428
m2 3 18211.19 2608.3443
  • best is m9: yday ~ temp_90 * stream + year + I(temp_90^2)
    • linear and quadratic temp effect, temp x stream interaction, additive year effect
  • next best is m6: yday ~ temp_90 * stream + year (no quadratic)

Let’s plot the predictions from m9 to see how well it fits the data.

Predicted spawn timing by stream and year.

(#fig:temp_90_preds)Predicted spawn timing by stream and year.

Observations from the plot

  • Strong curvature captured overall
  • Panels show increasing concave-down trend

Notable spatial breaks in some streams

  • Big and Camas
    • Both show discontinuities or grouping in temp ranges with clear vertical offsets
    • Possibly distinct COMIDs (upper vs. lower Big or Camas)
    • Or abrupt temperature transitions across years or geomorphic controls (elevation?)
  • Loon and Marsh
    • Patterns are tighter but still show micro-clustering → likely subtle COMID-level shifts.

Interpretation

  • Quadratic model captures macro-scale thermal responses well
  • But within-stream variation suggests that local (COMID-scale) differences are meaningful.
  • Stream-wide models may oversmooth important spatial heterogeneity.
  • Finer-scale modeling (e.g., random intercepts for COMID) could account for these offsets.

Visual inspection of stream- and year-specific fits revealed curvature in temperature–spawn timing relationships, with apparent within-stream variation in baseline timing across COMIDs. This suggested spatial heterogeneity not explained by stream-level fixed effects alone.

To account for this, we evaluated the addition of random intercepts for COMID with both top models (m6 and m9).

m6_re <- lmer(yday ~ temp_90 * stream + year + (1|COMID), data = model_data)
m9_re <- lmer(yday ~ temp_90 * stream + year + I(temp_90^2) + (1|COMID), data = model_data)
df AIC delta
m9_re 22 13011.83 0.00000
m6_re 21 13050.52 38.69329
m9 21 15602.84 2591.01254
m6 20 15846.65 2834.82322
  1. Effect of COMID Random Intercepts
  • Comparing m6 vs m6_re (same fixed effects, RE added): ΔAIC = 2796.1
  • Massive improvement → shows strong unmodeled structure at the COMID level
  • Including (1 | COMID) is important
  1. Effect of Quadratic Term
  • Comparing m6_re vs m9_re: ΔAIC = 38.7
  • evidence that the quadratic term explains real signal beyond COMID-level differences
  • The curvature isn’t just compensating for omitted structure — it’s meaningful even when that structure is accounted for
  1. Quadratic w/o RE vs Linear w/ RE
  • m9 (quadratic only) still has a much worse AIC than m6_re (linear w/ RE): 15602 vs 13050 → ΔAIC ≈ 2552
  • This shows that random effects contribute far more than nonlinearity alone
  • But combining them (in m9_re) gives best performance

Takeaways

  • Random intercepts account for baseline spawn timing differences between reaches
  • the quadratic temperature effect describes a real, nonlinear thermal response
  • Possibly related to metabolic thresholds, photoperiod interactions, or non-additive developmental cues
  • The combined model (m9_re) is supported: reflects spatial variation and nonlinear phenological drivers

Plot predictions with random intercepts:

Predicted spawn timing by stream and year with random intercepts.

(#fig:temp_90_preds-re)Predicted spawn timing by stream and year with random intercepts.

What this shows:

  • Vertical offsets in curves within each stream = COMID-specific random intercepts
  • Parallel lines = all COMIDs in a stream respond similarly to temperature, but start at different baselines
  • The inclusion of (1 | COMID) is doing its job — capturing consistent differences in spawn timing offset between reaches

Streams with wide COMID spread:

  • Big, Camas, and Bear Valley stand out
  • Likely reflect real ecological or geomorphic differences
    • Distance from confluence
    • Groundwater influence
    • Channel type, etc.
    • elevation?

Streams with tight COMID clustering:

  • Sulphur, Marsh, Loon show tighter fits
  • May indicate:
    • More homogeneous habitat conditions
    • Shorter stream length
    • Less vertical habitat diversity

The simplified model with COMID random intercepts is capturing reach-level variation without overwhelming the model. This is a strong candidate structure to carry forward or use as a baseline for comparing against more complex interaction models.

Thoughts from Dan: Temporal compression in lower reaches?

Dan’s field observation: fish in warmer, lower reaches spawn later and over a shorter window.

Upstream vs. downstream dynamics in Big, Camas, Loon, etc. likely manifest in:

  • Vertical intercept shifts (already captured by (1 | COMID))
  • But also differences in slope (i.e., how they respond to temperature)
  • the latter may suggest a true linear effect on much smaller spatial scales

Statistical implication:

Should we add COMID slopes?

The next logical step is to consider random slopes. This lets each COMID have its own slope for temp_90, capturing different rates of response to temperature.

But…

Overfitting risk! Here’s the situation:

Factor Status
COMID 108 levels
Many have < 5 redds
Some have only 1 redd

Random slopes are data-hungry. With sparse COMID-level samples, this model will:

  • Almost certainly be singular
  • Overfit slope noise as signal
  • Widen uncertainty bands a lot

So — stick with random intercepts, and revisit slopes later only if:

  • We restrict to high-sample COMIDs
  • Or pool into zones (e.g., upper/lower) for slope comparisons

Conclusion

I think we will proceed and compare the full interaction model with and without the quadratic term. We’ll Stick with (1 | COMID) for now — it gets the key structure without overfitting. And flag a future option to explore grouped random slopes, or COMID × temp_90 interaction if biologically and statistically justified.

6.2 Flow_90

Spawn timing vs. 90-day mean discharge pre spawn at MF Lodge.

(#fig:plot-flow_90)Spawn timing vs. 90-day mean discharge pre spawn at MF Lodge.

  • clear negative relationship, higher 90-day mean flows are associated with earlier spawn timing
  • relationship differs by year: 2003 linear but flattening or non-linear at higher flows
  • suggest different slopes or curves across years (year specific responses, but just intercepts)
  • stream-level variation as well, different intercepts and slopes
  • this is really a year effect with variation by stream and comid (local)

Check simple models for flow to examine functional structure and comapare with AIC:

m1 <- lm(yday ~ flow_90, data = model_data)
m2 <- lm(yday ~ I(flow_90^2), data = model_data)
m3 <- lm(yday ~ flow_90 + year, data = model_data)
m4 <- lm(yday ~ flow_90 + year + stream, data = model_data)
m5 <- lm(yday ~ flow_90 * stream, data = model_data)
m6 <- lm(yday ~ flow_90 * stream + year, data = model_data)
m7 <- lm(yday ~ flow_90 * year + stream, data = model_data)
m8 <- lm(yday ~ flow_90 * stream + I(flow_90^2), data = model_data)
m9 <- lm(yday ~ flow_90 * stream + year + I(flow_90^2), data = model_data)
df AIC delta
m7 16 9870.368 0.000
m9 21 12875.046 3004.678
m6 20 14100.258 4229.890
m4 13 14682.235 4811.867
m3 6 15098.512 5228.143
m8 18 17381.664 7511.296
m5 17 17618.492 7748.124
m1 3 18662.693 8792.325
m2 3 19039.519 9169.151
  • best is m7: yday ~ flow_90 * year + stream
  • flow x year interaction, additive stream effect

Let’s plot the predictions from m7 to see how well it fits the data.

Predicted spawn timing by year.

(#fig:flow_90_preds_year)Predicted spawn timing by year.

Predicted spawn timing by stream and year.

(#fig:flow_90_preds_all)Predicted spawn timing by stream and year.

  1. Strong, consistent negative relationship across panels
  • In every stream and year combo, yday decreases with increasing flow_90
  • This is a strong, consistent pattern across all years and streams.Suggests that higher pre-spawn flow leads to earlier spawning — ecologically plausible if higher flows coincide with snowmelt recession or cooler temps.
  1. Variation in slope
  • Slopes diverge across years, meaning the flow_90 × year interaction may be statistically retained and meaningfully variable.
  • This suggests a non-stationary flow effect over the four years.
  1. Flow is clearly confounded with year
  • Flow values in each year are clearly distinct
  • flow_90 behaves like a proxy for year, and its explanatory power may be redundant with year.
  • flow_90 is still a basin-wide measure, not stream-specific
  • Even if the relationship looks strong here, it may artificially inflate R² because it imposes a covariate that’s similar across streams for a given redd’s spawn date.
  • Could explain strong apparent fits without truly representing local flow conditions.

Ecological Interpretation

  • Spawn timing is earlier in years or settings with higher pre-spawn flow” — this makes sense, especially if:
    • Higher flows coincide with spring snowmelt → fish advance spawning
    • Lower flows delay environmental cues

But this doesn’t necessarily mean site-specific flow is a driver. Just that flow_90 tracks broader seasonal variation (i.e., interannual hydrology), already captured by year.

Flow could be valid in a year-based model, or as a year-level summary, but not site-level unless spatially resolved.

6.2.1 To include or not to include flow_90?

Including flow_90 could introduce spurious precision and possibly overfitting. Why flow_90 might be problematic:

  1. Not spatially resolved
  • We are modeling spawn timing at the redd level (COMID/stream)
  • But flow_90 is calculated from a single downstream gauge, and applied to all redds
  • This assumes flow conditions are identical across all sites, which is rarely true in a branching stream network
  • Including it gives the illusion of spatially resolved variation that isn’t there
  1. Likely correlated with year
  • Since flow_90 varies mostly across years, (albeit slightly with streams), it is strongly confounded with year
  • Any flow-related signal is probably already captured by your year fixed effect
  • Including both flow_90 and year risks collinearity, and may produce misleading inferences
  1. Spawn-time aligned flow ≠ experienced flow
  • While flow_90 is aligned to each redd’s spawn date, it still reflects a lower-basin gauge, not the actual hydrologic conditions experienced at the redd site
  • So it might be precisely wrong — aligned in time but irrelevant in space

Bottom Line:

Unless we have site-specific or spatially disaggregated flow data, flow_90 is probably not a valid covariate for redd-level models.

Including it may:

  • Overfit due to noise or pseudo-replication
  • Complicate interpretation (e.g., why one stream “responds” to flows measured elsewhere?)
  • Mask true year or site effects

Recommendation:

Drop flow_90 from model (or at most, keep it as a year-level covariate if we summarize it to a single annual value for all observations)

Text for ms:

Although we initially considered including 90-day mean streamflow (flow_90) as a predictor of spawn timing, this variable was ultimately excluded due to concerns about ecological validity and model overfitting. Streamflow data were derived from a single downstream USGS gauge and did not capture spatial variation across the study streams or reaches. Moreover, because flow_90 was closely aligned with year, it introduced strong collinearity with the year effect and risked attributing site-level variation to flow patterns not actually experienced by individual redds. As such, we excluded flow_90 to avoid misleading inference.

7 Final dataset and scale covariates

Final dataset includes:

  • response: spawn time (yday), continuous
  • grouping variables: comid, stream, year
  • covariates: temp_90, elevation
## # A tibble: 3,013 × 7
##     yday COMID    stream      year  temp_90 mean_elevation  slope
##    <dbl> <fct>    <fct>       <fct>   <dbl>          <dbl>  <dbl>
##  1   235 23519365 Bear Valley 2002   -0.316          0.717 -0.519
##  2   235 23519365 Bear Valley 2002   -0.316          0.717 -0.519
##  3   235 23519319 Bear Valley 2002    0.471          0.694 -0.711
##  4   235 23519319 Bear Valley 2002    0.471          0.694 -0.711
##  5   235 23519319 Bear Valley 2002    0.471          0.694 -0.711
##  6   235 23519317 Bear Valley 2002    0.303          0.683 -0.539
##  7   238 23519317 Bear Valley 2002    0.468          0.683 -0.539
##  8   246 23519317 Bear Valley 2002    0.828          0.683 -0.539
##  9   238 23519317 Bear Valley 2002    0.468          0.683 -0.539
## 10   246 23519317 Bear Valley 2002    0.828          0.683 -0.539
## # ℹ 3,003 more rows

8 Model specification

8.1 Fixed and random effects

We should only include a grouping variable as both fixef and ranef when you want to model differenct aspects of the effect. E.g., stream as fixed to compared average effects by stream, but as ranefs to account for non-independence of obs within each stream but not to compare average effects. Include as both when you want population-level estimates AND when you expect stream-level random variation around those means. Do not include both when intercepts are redundant (~ stream + (1| stream)), or too few levels for the random effect to be estimated (e.g., <5 levels).

Variable Fixed? Random? Why
COMID No Yes (but must check data availability among levels; likely sparse) Not estimating individual COMID effects, just accounting for correlation.
Stream Yes Maybe (only if random slope or complex structure) May want to estimate differences and account for grouping.
Year Yes (comparing years) Maybe (account for inter-annual variability) Use one or the other depending on goal.

In this case, it makes sense to include year as fixed because:

  • only has 4 levels - so a ranef would estimate variance poorly and shrink aggressively
  • year captures unmeasured inter annual variability (snow pack, flow, temp anomalies all wrapped in)
  • we want to compare among years given we only have 4 years of data, hard to extrapolate

Let’s interrogate the grouping structure within the data to make a final decision about COMIDs.

8.2 Grouping structure of data

stream n_COMIDs
Big 21
Loon 19
Camas 16
Marsh 13
Bear Valley 11
Sulphur 11
Beaver 7
Elk 6
Number of observations (redds) per COMID.

Figure 8: Number of observations (redds) per COMID.

8.2.1 Summary

Are the enough COMIDs per stream to consider stream/COMID nested random effects? Not really…

Are groups well sampled? (Do most COMIDs have >1-2 redds?) No. 23 COMIDs have <5 redds (25.9615385%), 13 have <= 2. (12.5%). With <5 obs/level, variance estimates become unstable -> overfit and absorb noise (low AIC / high R2) -> singular fits.

Are year or stream-level random effects justifiable? We want to compare streams and years in this dataset, not generalize beyond them, so we should use fixed effects. Further, stream are known, of interest, and were not randomly sampled. Including (1 | stream) would be statistically redundant (stream intercepts would be double-modeled) and would lead to misleading AIC comparisons.

Random slopes?. The needs multiple obs per group across a range of slope variable (temp_90), enough replication to estimate variation in slopes, not just intercepts. Need ~8+ obs per group spread across the covariate. We could do this for stream or year. Prob not.

So, we should use a simple linear model with no random effects. This allows for interpretable results, no overfitting from random effects, and is a reasonable approach given the design.

If we did try (1 | COMID), expect singularity, low variance estimates for COMID, and predictions will likely miss the groups means.

That said, as shown in Temp_90, we probably should be including COMID as a random effect. So we many have to simple the model (e.g., omit elevation), to ovoid over fitting.

9 Model fitting and comparison

9.1 Simple LM

First we’ll fit simple linear models to get a sense of the data and the relationships. Adding new interactions each time.

# All combinations of 1–5 fixed effects: temp_90, stream, year, mean_elevation, SLOPE
# 31 total
m1 <- lm(yday ~ temp_90, data = model_data)
m2 <- lm(yday ~ stream, data = model_data)
m3 <- lm(yday ~ year, data = model_data)
m4 <- lm(yday ~ mean_elevation, data = model_data)
m5 <- lm(yday ~ slope, data = model_data)

m6  <- lm(yday ~ temp_90 + stream, data = model_data)
m7  <- lm(yday ~ temp_90 + year, data = model_data)
m8  <- lm(yday ~ temp_90 + mean_elevation, data = model_data)
m9  <- lm(yday ~ temp_90 + slope, data = model_data)
m10 <- lm(yday ~ stream + year, data = model_data)
m11 <- lm(yday ~ stream + mean_elevation, data = model_data)
m12 <- lm(yday ~ stream + slope, data = model_data)
m13 <- lm(yday ~ year + mean_elevation, data = model_data)
m14 <- lm(yday ~ year + slope, data = model_data)
m15 <- lm(yday ~ mean_elevation + slope, data = model_data)

m16 <- lm(yday ~ temp_90 + stream + year, data = model_data)
m17 <- lm(yday ~ temp_90 + stream + mean_elevation, data = model_data)
m18 <- lm(yday ~ temp_90 + stream + slope, data = model_data)
m19 <- lm(yday ~ temp_90 + year + mean_elevation, data = model_data)
m20 <- lm(yday ~ temp_90 + year + slope, data = model_data)
m21 <- lm(yday ~ temp_90 + mean_elevation + slope, data = model_data)
m22 <- lm(yday ~ stream + year + mean_elevation, data = model_data)
m23 <- lm(yday ~ stream + year + slope, data = model_data)
m24 <- lm(yday ~ stream + mean_elevation + slope, data = model_data)
m25 <- lm(yday ~ year + mean_elevation + slope, data = model_data)

m26 <- lm(yday ~ temp_90 + stream + year + mean_elevation, data = model_data)
m27 <- lm(yday ~ temp_90 + stream + year + slope, data = model_data)
m28 <- lm(yday ~ temp_90 + stream + mean_elevation + slope, data = model_data)
m29 <- lm(yday ~ temp_90 + year + mean_elevation + slope, data = model_data)
m30 <- lm(yday ~ stream + year + mean_elevation + slope, data = model_data)

m31 <- lm(yday ~ temp_90 + stream + year + mean_elevation + slope, data = model_data)
Model df AIC delta
m31 15 15849.91 0.00000
m26 14 15933.83 83.91617
m27 14 16190.84 340.92786
m16 13 16245.22 395.30513
m28 12 16950.14 1100.23075
m18 11 16973.74 1123.82891
m17 11 16993.81 1143.89804
m6 10 17012.83 1162.92236
m29 8 17482.46 1632.55077
m19 7 17568.43 1718.51997
Rank Model Formula (inferred) df AIC ΔAIC
1 m31 temp_90 + stream + year + mean_elevation + SLOPE 15 15849.91 0.00
2 m26 temp_90 + stream + year + mean_elevation 14 15933.83 83.91
3 m27 temp_90 + stream + year + SLOPE 14 16190.84 340.93
4 m16 temp_90 + stream + year 13 16245.22 395.31
5 m28 temp_90 + stream + mean_elevation + SLOPE 13 16950.14 1100.23

Interpretation

Model m31 (best) - Contains all five covariates - Suggests that each adds unique, non-redundant signal - Likely a solid benchmark, but might be at risk of overfitting or including covariates of weak effect

🔻 Model m26 - Drops slope → minimal ΔAIC (Δ = 84) - Suggests slope may not be that influential when year and elevation are included

🔻 Model m27 - Drops mean_elevation instead → ΔAIC = 341 - Implies elevation likely provides more explanatory power than slope in this model context

🔻 Model m16 - Drops both mean_elevation and slope - Still relatively good (Δ = 395), meaning that temp_90 + stream + year explains the bulk of variation

🔻 Model m28 - general model as suggest by Dan - Omits year, includes everything else → ΔAIC = 1100 - Big drop confirms that year is critically important, likely absorbing interannual hydrology, climate variation, or escapement.

Takeaways

  • year and stream are foundational — they’re in all top models.
  • temp_90 is always present — unsurprisingly.
  • Mean elevation is more helpful than slope, but may interact or confound in full models (as you saw earlier).
  • slope adds little alone but may act synergistically in certain combinations.

Recommendation:

  • Retain year, stream, temp_90 in all further models.
  • Test dropping slope from top models — see if model simplification is justified.
  • Explore parameter estimates from m31 to see if elevation and slope are meaningful (i.e., significant and consistent).

Look at the model estimates for top 5 models:

  Full model No slope No elevation No elevation or slope No year
Predictors Estimates CI p Estimates CI p Estimates CI p Estimates CI p Estimates CI p
(Intercept) 111.42 104.62 – 118.23 <0.001 115.37 108.52 – 122.22 <0.001 175.79 174.21 – 177.37 <0.001 177.23 175.68 – 178.77 <0.001 162.89 155.55 – 170.22 <0.001
temp 90 7.09 6.88 – 7.31 <0.001 6.92 6.71 – 7.14 <0.001 5.36 5.23 – 5.48 <0.001 5.26 5.13 – 5.39 <0.001 5.32 5.09 – 5.55 <0.001
stream [Beaver] 9.57 8.95 – 10.19 <0.001 9.78 9.15 – 10.41 <0.001 7.93 7.30 – 8.56 <0.001 8.16 7.53 – 8.79 <0.001 6.40 5.69 – 7.11 <0.001
stream [Big] 10.78 9.60 – 11.96 <0.001 11.21 10.01 – 12.40 <0.001 0.19 -0.28 – 0.66 0.434 0.87 0.43 – 1.31 <0.001 3.03 1.72 – 4.35 <0.001
stream [Camas] 8.17 7.28 – 9.06 <0.001 9.01 8.13 – 9.90 <0.001 1.22 0.66 – 1.78 <0.001 2.15 1.64 – 2.66 <0.001 3.84 2.82 – 4.87 <0.001
stream [Elk] 1.58 1.09 – 2.06 <0.001 1.26 0.77 – 1.75 <0.001 -0.01 -0.49 – 0.48 0.983 -0.23 -0.71 – 0.25 0.353 0.18 -0.39 – 0.76 0.533
stream [Loon] 12.20 11.29 – 13.12 <0.001 12.19 11.26 – 13.11 <0.001 4.99 4.43 – 5.55 <0.001 5.19 4.63 – 5.76 <0.001 8.13 7.07 – 9.19 <0.001
stream [Marsh] 4.48 4.00 – 4.97 <0.001 4.94 4.45 – 5.42 <0.001 5.82 5.32 – 6.31 <0.001 6.17 5.68 – 6.66 <0.001 5.09 4.51 – 5.67 <0.001
stream [Sulphur] 13.99 12.91 – 15.06 <0.001 13.43 12.34 – 14.52 <0.001 5.23 4.61 – 5.86 <0.001 5.02 4.39 – 5.65 <0.001 5.28 4.13 – 6.44 <0.001
year [2003] -3.86 -4.17 – -3.55 <0.001 -3.85 -4.16 – -3.54 <0.001 -2.86 -3.16 – -2.55 <0.001 -2.88 -3.19 – -2.57 <0.001
year [2004] 2.18 1.79 – 2.57 <0.001 2.13 1.73 – 2.52 <0.001 1.74 1.33 – 2.15 <0.001 1.71 1.29 – 2.12 <0.001
year [2005] 3.94 3.44 – 4.45 <0.001 3.89 3.38 – 4.40 <0.001 3.92 3.39 – 4.45 <0.001 3.88 3.34 – 4.42 <0.001
mean elevation 0.02 0.02 – 0.02 <0.001 0.02 0.02 – 0.02 <0.001 0.01 0.00 – 0.01 <0.001
slope 109.56 86.49 – 132.62 <0.001 93.47 69.12 – 117.82 <0.001 95.58 67.90 – 123.25 <0.001
Observations 3013 3013 3013 3013 3013
R2 / R2 adjusted 0.840 / 0.839 0.835 / 0.835 0.821 / 0.820 0.817 / 0.817 0.769 / 0.768
  1. Consistent Predictors
  • temp_90, stream, and year are consistently significant across all top models.
  • Their estimates are stable and interpretable.These should be locked in as core covariates.
  1. Elevation vs. Slope
  • mean_elevation has:
    • A modestly sized effect
    • Marginal p-values (some < 0.05, some ~0.1)
  • SLOPE is more variable:
    • Sometimes barely significant
    • Sometimes not at all
  • removing SLOPE from the top model (m31 → m26) causes only a moderate AIC jump (ΔAIC ~84)

Interpretation: Elevation is probably doing more work than slope, but neither may be critical once stream and year are accounted for.

  1. Watch for redundancy
  • In m31, both stream and elevation are included.
  • If elevation is mostly captured by stream identity, it may be redundant.

Recommendations - Retain: temp_90, stream, year - Evaluate: - Keep mean_elevation if effect size is consistent and interpretable - Drop SLOPE unless it provides a strong ecological rationale

Next: - Try a version of the best model without elevation and slope, or just elevation, to test sensitivity - Proceed to adding the quadratic term (I(temp_90^2)) - Then test COMID as a random intercept - test interactions

Compare targeted models first, Why?

  • We’ve already identified that temp_90, stream, and year are critical
  • elevation and slope are questionable in explanatory value
  • Including interactions too soon could:
    • Mask whether those weaker covariates matter
    • Inflate model complexity before locking in the core structure

What to do:

  • Compare:
  • temp_90 + stream + year (baseline)
  • Add mean_elevation and/or SLOPE
  • Then add I(temp_90^2)
  • Then add (1 | COMID)

This gives you a clean sense of model structure before going into interactions.

9.2 Targeted model comparison

df AIC delta
mod_quad_re 15 13452.18 0.000
mod_quad_elev 15 15493.04 2040.858
mod_quad 14 15791.11 2338.923
mod_elev 14 15933.83 2481.644
mod_base 13 16245.22 2793.033
Model Formula Concept df AIC ΔAIC
mod_quad_re `temp_90 + I(temp_90^2) + stream + year + (1 COMID) 15 13452.2
mod_quad_elev temp_90 + I(temp_90^2) + stream + year + elevation 15 15493.0 2040.9
mod_quad temp_90 + I(temp_90^2) + stream + year 14 15791.1 2338.9
mod_elev temp_90 + stream + year + elevation 14 15933.8 2481.6
mod_base temp_90 + stream + year 13 16245.2 2793.0

Interpretation - mod_quad_re (winner) - Massive AIC improvement — >2000 points — over any model lacking COMID as a random effect - Also supports the value of the quadratic temperature term - This is clear front-runner

mod_quad_elev vs mod_quad - Elevation does slightly improve fit when random effects are excluded (ΔAIC ≈ 140) - But when (1 | COMID) is included, the value of elevation vanishes (i.e., you never tested mod_quad_re + elevation, which is telling) - Suggests elevation is already absorbed by COMID-level differences — consistent with earlier findings

Conclusion - Keep: temp_90, I(temp_90^2), stream, year, and (1 | COMID) - Drop: mean_elevation (already accounted for in COMID) - The winning model (mod_quad_re) is simple, interpretable, and statistically robust

9.3 interactions

df AIC delta
mod_interact3 25 11673.57 0.000
mod_interact2 18 12109.61 436.041
mod_interact1 22 13020.03 1346.456
mod_quad_re 15 13452.18 1778.614
Model Formula Concept df AIC ΔAIC
mod_interact3 `temp_90 * stream + temp_90 * year + I(temp_90^2) + (1 COMID)` 25 11673.6 0.0
mod_interact2 `temp_90 * year + stream + I(temp_90^2) + (1 COMID)` 18 12109.6 436.0
mod_interact1 `temp_90 * stream + year + I(temp_90^2) + (1 COMID)` 22 13020.0 1346.5
mod_quad_re `temp_90 + I(temp_90^2) + stream + year + (1 COMID)` 15 13452.2 1778.6

Best model: mod_interact3 - Including both temp_90 × stream and temp_90 × year yields dramatic AIC improvement (~1779 units better than additive model) - Suggests that both spatial and temporal variation in temperature sensitivity are biologically meaningful

Likely reflects differences in: - Local stream-specific phenology (via temp_90 × stream) - Year-to-year climate anomalies and flow/thermal dynamics (via temp_90 × year)

mod_interact2 better than mod_interact1 - Interannual variation (year) appears more influential than streamwise variation in temp_90 response - But stream still matters — together, they outperform either alone

##                       model      AIC        R2       ICC
## mod_quad_re     mod_quad_re 13452.18 0.7413331 0.9220937
## mod_interact1 mod_interact1 13020.03 0.7250133 0.9459792
## mod_interact2 mod_interact2 12109.61 0.6730601 0.9771622
## mod_interact3 mod_interact3 11673.57 0.6705336 0.9825959

As expected, the interactions are overfitting.

So What’s Going On? 1. Low AIC but bad predictions = likely overfitting - AIC rewards goodness-of-fit but penalizes complexity — and the interaction model may be “winning” by fitting spurious patterns (e.g. small temp ranges in some groups leading to flipped quads).

  1. Nonlinear fits breaking down in interactions
  • The quadratic term interacts awkwardly with groups having narrow temp ranges (common issue with poly() or I(temp^2) + factor interaction)
  1. R² > AIC when interpretability matters
  • Since we care about understanding spawn timing across space/time, R² + prediction realism > pure AIC win

Recommendation - Stick with mod_quad_re - Excellent fit (highest R²), interpretable, smoother predictions - AIC difference (~1778) is likely an artifact of interaction complexity - Most biologically plausible model

→ Suggest making this your primary model

Final model predictions.

(#fig:plot_final-1)Final model predictions.

Final model predictions.

(#fig:plot_final-2)Final model predictions.

–>